Contributing Factors to COVID-19 Contraction and Death in the United States
Authors: Francis Oparaocha and Kenneth Somerville
Table of Contents
We think it is safe to say that the Corona Virus has affected many aspects of our lives. From stay-at-home orders to wearing masks wherever you go, life is not the same. For this project, we look at a few of the contributing factors that lead to the contraction of the Corona Virus and death within the US. The factors we look at are time, states’ population, states’ wealth distribution, population density, states’ median income, and the percentage of people in each social class for each state. This is not a comprehensive list, so we do not include factors such as public policy related to COVID-19, current events, or the ages of people contracting or dying from COVID-19. In this project we are answering the question is there a correlation between time, states’ population, states’ the wealth distribution, population density, states’ median income, the percentage of people in each social class for each state, and the number of COVID-19 cases and deaths in the United States.
Here are some of the key terms that we will use in this project:
COVID-19:
COVID-19 is caused by a coronavirus called SARS-CoV-2. Older adults and people who have severe underlying medical conditions like heart or lung disease or diabetes seem to be at higher risk for developing more serious complications from COVID-19 illness
Population Density:
Population density is a measurement of population per unit area, or exceptionally unit volume; it is a quantity of type number density.
Median Income:
The median income is the income amount that divides a population into two equal groups, half having an income above that amount, and half having an income below that amount.
Upper Class:
The economic group with the greatest wealth and power in society. For each state upper class are household are those whose income is at least 50% higher than the median household income.
Middle Class:
The economic group between the upper and lower classes. For each state middle class are households earning between two-thirds and twice the median household income.
Lower Class:
The economic group with the least wealth and power in society. For each state lower class are households whose annual household income is less than two-thirds the state median.
LIBRARIES:
In this project, we used several data science libraries such as Pandas and Numpy to scrape and store US COVID-19, population, and income data from internet databases. We used MatPlotLib and Seaborn to represent our data through different types of graphs. Finally, we use SkLearn for linear modeling and to show linear regression on graphs.
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from sklearn.linear_model import LinearRegression
To start off, we created dataframes by using data from the New York Times and the US Census to get the daily COVID cases and deaths for each county in American and to get the most recently reported population for each state. The numbers for COVID cases and deaths were cross checked with other reputable sources such as state heatlh department websites. Below is the code and also the combined table of the data.
all_counties = pd.read_csv('https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv',sep=',')
populations = pd.read_csv('https://www2.census.gov/programs-surveys/popest/datasets/2010-2019/national/totals/nst-est2019-popchg2010_2019.csv',sep=',')
populations = pd.concat([populations[['NAME']],populations[['POPESTIMATE2019']]],axis=1)
populations = populations.drop([0,1,2,3,4])
populations = populations.reset_index(drop=True)
populations = populations.sort_values(by=['NAME'])
areas = pd.read_csv('https://raw.githubusercontent.com/jakevdp/data-USstates/master/state-areas.csv',sep=',')
areas = areas.sort_values(by=['state'])
areas = areas.reset_index(drop=True)
areas = areas.drop(['state'],axis=1)
pop_density = pd.concat([populations,areas],axis=1)
pop_density['Population Density'] = [None]*52
for index,row in pop_density.iterrows():
pop_density.at[index,'Population Density'] =row['POPESTIMATE2019']/row['area (sq. mi)']
pop_density= pop_density.rename(columns={'NAME': 'State'})
pop_density
| State | POPESTIMATE2019 | area (sq. mi) | Population Density | |
|---|---|---|---|---|
| 0 | Alabama | 4903185 | 52423 | 93.5312 |
| 1 | Alaska | 731545 | 656425 | 1.11444 |
| 2 | Arizona | 7278717 | 114006 | 63.845 |
| 3 | Arkansas | 3017804 | 53182 | 56.7448 |
| 4 | California | 39512223 | 163707 | 241.359 |
| 5 | Colorado | 5758736 | 104100 | 55.3193 |
| 6 | Connecticut | 3565287 | 5544 | 643.089 |
| 7 | Delaware | 973764 | 1954 | 498.344 |
| 8 | District of Columbia | 705749 | 68 | 10378.7 |
| 9 | Florida | 21477737 | 65758 | 326.618 |
| 10 | Georgia | 10617423 | 59441 | 178.621 |
| 11 | Hawaii | 1415872 | 10932 | 129.516 |
| 12 | Idaho | 1787065 | 83574 | 21.383 |
| 13 | Illinois | 12671821 | 57918 | 218.789 |
| 14 | Indiana | 6732219 | 36420 | 184.85 |
| 15 | Iowa | 3155070 | 56276 | 56.0642 |
| 16 | Kansas | 2913314 | 82282 | 35.4065 |
| 17 | Kentucky | 4467673 | 40411 | 110.556 |
| 18 | Louisiana | 4648794 | 51843 | 89.6706 |
| 19 | Maine | 1344212 | 35387 | 37.986 |
| 20 | Maryland | 6045680 | 12407 | 487.28 |
| 21 | Massachusetts | 6892503 | 10555 | 653.008 |
| 22 | Michigan | 9986857 | 96810 | 103.159 |
| 23 | Minnesota | 5639632 | 86943 | 64.8659 |
| 24 | Mississippi | 2976149 | 48434 | 61.4475 |
| 25 | Missouri | 6137428 | 69709 | 88.0436 |
| 26 | Montana | 1068778 | 147046 | 7.26832 |
| 27 | Nebraska | 1934408 | 77358 | 25.0059 |
| 28 | Nevada | 3080156 | 110567 | 27.8578 |
| 29 | New Hampshire | 1359711 | 9351 | 145.408 |
| 30 | New Jersey | 8882190 | 8722 | 1018.37 |
| 31 | New Mexico | 2096829 | 121593 | 17.2447 |
| 32 | New York | 19453561 | 54475 | 357.11 |
| 33 | North Carolina | 10488084 | 53821 | 194.87 |
| 34 | North Dakota | 762062 | 70704 | 10.7782 |
| 35 | Ohio | 11689100 | 44828 | 260.754 |
| 36 | Oklahoma | 3956971 | 69903 | 56.6066 |
| 37 | Oregon | 4217737 | 98386 | 42.8693 |
| 38 | Pennsylvania | 12801989 | 46058 | 277.954 |
| 39 | Rhode Island | 1059361 | 3515 | 301.383 |
| 40 | South Carolina | 5148714 | 1545 | 3332.5 |
| 41 | South Dakota | 884659 | 32007 | 27.6395 |
| 42 | Tennessee | 6829174 | 77121 | 88.5514 |
| 43 | Texas | 28995881 | 42146 | 687.987 |
| 44 | Utah | 3205958 | 268601 | 11.9358 |
| 45 | Vermont | 623989 | 84904 | 7.34935 |
| 46 | Virginia | 8535519 | 9615 | 887.729 |
| 47 | Washington | 7614893 | 42769 | 178.047 |
| 48 | West Virginia | 1792147 | 71303 | 25.1342 |
| 49 | Wisconsin | 5822434 | 24231 | 240.289 |
| 50 | Wyoming | 578759 | 65503 | 8.83561 |
| 51 | Puerto Rico | 3193694 | 97818 | 32.6493 |
cases_state_day = all_counties.copy()
cases_state_day = cases_state_day.groupby(['date','state'])[['cases','deaths','date']].sum()
cases_state_day.reset_index(inplace=True)
state_list = list(set(cases_state_day['state']))
state_list = np.array_split(state_list, 6)
#list of dataframes. each dataframe contains the number of covid cases, deaths,
#for each day for all US states/territories
df_list = []
for ten_states in state_list:
df = pd.DataFrame()
for state in ten_states:
df = df.append(cases_state_day.drop(cases_state_day[cases_state_day['state'] != state].index))
df_list.append(df)
Using the census.gov website again, we scraped data for the median income of each state and combined it together into a pandas dataframe.
url1 = 'https://drive.google.com/file/d/1Ou29O44sa_iOTEXPEKYi8Y6-QZq_a8AB/view?usp=sharing'
url2='https://drive.google.com/uc?id=' + url1.split('/')[-2]
incomes = pd.read_excel(url2,header=1)
incomes = incomes.drop([0,1,2,3,4,111,112])
incomes = incomes.rename(columns={'Table H-8. Median Household Income by State: 1984 to 2019': 'State'})
incomes = incomes.rename(columns={'Unnamed: 1': 'Median Income'})
incomes = pd.concat([incomes[['State']],incomes[['Median Income']]],axis=1)
incomes = incomes.drop_duplicates()
incomes = incomes.drop([56,57,58,59])
incomes = incomes.sort_values(by=['State'])
incomes = incomes.reset_index(drop=True)
incomes
| State | Median Income | |
|---|---|---|
| 0 | Alabama | 56200 |
| 1 | Alaska | 78394 |
| 2 | Arizona | 70674 |
| 3 | Arkansas | 54539 |
| 4 | California | 78105 |
| 5 | Colorado | 72499 |
| 6 | Connecticut | 87291 |
| 7 | Delaware | 74194 |
| 8 | District of Columbia | 93111 |
| 9 | Florida | 58368 |
| 10 | Georgia | 56628 |
| 11 | Hawaii | 88006 |
| 12 | Idaho | 65988 |
| 13 | Illinois | 74399 |
| 14 | Indiana | 66693 |
| 15 | Iowa | 66054 |
| 16 | Kansas | 73151 |
| 17 | Kentucky | 55662 |
| 18 | Louisiana | 51707 |
| 19 | Maine | 66546 |
| 20 | Maryland | 95572 |
| 21 | Massachusetts | 87707 |
| 22 | Michigan | 64119 |
| 23 | Minnesota | 81426 |
| 24 | Mississippi | 44787 |
| 25 | Missouri | 60597 |
| 26 | Montana | 60195 |
| 27 | Nebraska | 73071 |
| 28 | Nevada | 70906 |
| 29 | New Hampshire | 86900 |
| 30 | New Jersey | 87726 |
| 31 | New Mexico | 53113 |
| 32 | New York | 71855 |
| 33 | North Carolina | 61159 |
| 34 | North Dakota | 70031 |
| 35 | Ohio | 64663 |
| 36 | Oklahoma | 59397 |
| 37 | Oregon | 74413 |
| 38 | Pennsylvania | 70582 |
| 39 | Rhode Island | 70151 |
| 40 | South Carolina | 62028 |
| 41 | South Dakota | 64255 |
| 42 | Tennessee | 56627 |
| 43 | Texas | 67444 |
| 44 | Utah | 84523 |
| 45 | Vermont | 74305 |
| 46 | Virginia | 81313 |
| 47 | Washington | 82454 |
| 48 | West Virginia | 53706 |
| 49 | Wisconsin | 67355 |
| 50 | Wyoming | 65134 |
Predictions:
The information used in these graphs is from https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv a database with the COVID cases and deaths per date for each county in the United States. The database is from a New York Times article.
These graphs are to show a timeline of COVID-19 cases and deaths for all US States and Territories. I hypothesize that in the early months of the pandemic most states had a few hundred cases, and some would have around 1,000-5,000 cases. At the beginning of the pandemic, there was a lot of hysteria and unknown factors surrounding the pandemic. This could have caused people to follow the stay-at-home orders more closely, so the number of cases should be at the lowest. I predict a spike in COVID cases at the beginning of June 2020. During that time, many states were prematurely loosening up COVID restrictions. Months later many states reinstated the restrictions.
In the later months, particularly around the holiday season, I expect the amount of COVID cases to spike. Despite CDC warnings, many Americans were still traveling which would lead to more people spreading the virus. People were also tired of following the stay-at-home regulations, and in general, people became more complacent about the pandemic which could have led to an increase in cases. Also, testing became more common and more efficient. COVID tests in spring 2020 would take up to 14 days to give results, and now tests can give results within the same day. Faster results lead to more people getting tested, and more people getting tested leads to more positive test results. The graphs may show a drop in covid cases and deaths after March 2020. In January, vaccines became available for at-risk groups and essential workers, by March COVID-19 vaccines were made available to everyone. So, I predict that after March the new number of new cases per day will decrease or flatline because more people are vaccinated.
#Graphs for covid cases over time
for df in df_list:
plt.figure(figsize=(16,9))
ax =sns.lineplot(data=df.sort_values('date'), x="date", y="cases", hue="state", linewidth = 3)
plt.xticks( rotation='vertical')
xticks=ax.xaxis.get_major_ticks()
for i in range(len(xticks)):
if not(i%17==0 or i==0):
xticks[i].set_visible(False)
plt.ylabel('Cases x 1,000,000')
plt.title('Cases over Time')
plt.show()
RESULTS FOR CASES OVER TIME:
The 6 different Case over Time graphs shows a timeline of the number of reported COVID-19 cases for each state since the beginning of the pandemic.
In the "Case over Time" graphs most of the lines followed our predictions. We predicted that at the beginning of the pandemic the number of cases would be the lowest. We predicted each state would have 0 to 5,000 reported COVID cases. Many factors could have caused the low COVID-19 case numbers from inefficient testing to Americans taking the "stay at home" regulations and social distancing more seriously. Most states have COVID cases under 10,000 until June 2020. Some outlier states that do not follow the trend of low case numbers, in the beginning, are New York with about 150,000 cases by the end of March 2020, and with around 400,000 by May 2020. New Jersey also had approximately 150,000 cases by April 2020. These states could genuinely have more COVID cases than the other states, or there could have been better testing facilities in these states. Another factor is that New York and New Jersey have big cities and many people commute to and from the city which could have led to more cases.
Surprising information from the graphs is that Alaska, Guam, The Virgin Island, Hawaii, and the Northern Mariana Islands each has under 10,000 COVID cases. All of these locations are separate from the mainland United States, so it is likely that at the beginning of the pandemic these regions completely cut off travel, to prevent the spread of the virus. Another explanation could be that several of these regions require any tourists or people leaving and entering to take COVID-19 Tests, and the people must quarantine while they away test results. This would likely deter tourism and help prevent the virus from even entering the region at all.
Our prediction that COVID cases would spike in June 2020 was correct. Many states have more COVID cases reported starting in June or late May. A likely explanation is because state governments all across the country lifted many COVID restrictions in June 2020. Similarly, in July 2020 many states also reinstated the COVID restrictions because COVID was still prevalent.
Finally, many states had a major increase in COVID cases around December 2020. We predicted that this would happen because of the amount of American’s traveling around the holiday season. Almost all of the states had a major increase of COVID-19 cases from November 2020 - January 2021 with some of the most notable increases being from California, New York, Florida, and Texas. Each of those states surpassed 1,000,000 cases by the beginning of 2021, with Texas and California surpassing 1,500,000 cases.
PREDICTIONS FOR COVID DEATHS OVER TIME:
I expect the COVID-19 deaths to closely follow the graphs of Corona Virus cases. Similarly, I expect the states with the most COVID cases to have the most COVID-related deaths. The states with the most COVID cases naturally have the potential to have the most COVID deaths, and I would be very surprised if a state with low COVID case numbers has a high COVID death toll.
Much like the Cases over Time graphs I expect each state to have under 5000 total COVID deaths up May 2020. It is also likely that the states with high numbers of COVID cases in March and April also have many COVID-related deaths. I also expect there to be a spike in COVID deaths around June 2020 because of states lifting COVID restrictions, and spikes in November 2020 – January 2020 because of the holiday season.
#GRAPHS for covid deaths over time
for df in df_list:
plt.figure(figsize=(16,9))
ax = sns.lineplot(data=df.sort_values('date'), x="date", y="deaths", hue="state",linewidth = 3)
plt.xticks(rotation='vertical')
xticks=ax.xaxis.get_major_ticks()
for i in range(len(xticks)):
if not(i%17==0 or i==0):
xticks[i].set_visible(False)
plt.ylabel('Deaths')
plt.title('Deaths over Time')
plt.show()
Many of the state's first COVID deaths began around the end of March 2020 and the beginning of April 2020. Unsurprisingly the by the end of May 2020 almost 30,000 people in New York had died because of COVID. By the end of May Pennsylvania, Massachusetts, and Illinois had all surpassed 5,000 COVID deaths. In New Jersey, by June 25th there had been over 15,000 COVID deaths. In May 2020 California had over 5,000 deaths, and surprisingly the Florida and Texas had under 5,000 deaths each until August 2020. It is surprising because in the Cases over Time graphs those two states had some of the highest numbers of cases.
The regions like Hawaii, Alaska, Guam, The US Virgin Islands, and the Northern Mariana Islands all had Covid deaths under 1000. This is unsurprising because those regions had cases under 1,000.
In November 2020 most of the state’s COVID death toll began to spike. A likely reason for the increase is the holiday season and Americans traveling. Finally, the states with the most COVID deaths are Pennsylvania with over 25,000 deaths, New Jersey with over 25,000 deaths, Florida with over 35,000 deaths, California with over 60,000 deaths, Texas with 50,000 deaths, and New York with 50,000 deaths. These states have large metropolitan and large populations, so it is no surprise that the most COVID deaths happened in these states. This list is also similar to the states with the largest amount of COVID cases. Later in the project, we determine if there is a correlation between the population and the reported COVID cases and deaths a state has.
To get a better grasp on the income data, we created a spiral bar graph that has all of the median income of states in accending order
plt.figure(figsize=(50,15))
increasing = incomes.sort_values(by=['Median Income'])
values = increasing['Median Income']
labels = increasing['State']
theta=np.arange(0,2*np.pi,2*np.pi/len(values))
width = (2*np.pi)/len(values) *0.9
bottom = 50
fig = plt.figure(figsize=(8,8))
ax = fig.add_axes([0.1, 0.1, 0.75, 0.75], polar=True)
bars = ax.bar(theta, np.array(values), width=width, bottom=bottom)
plt.axis('off')
rotations = np.rad2deg(theta)
for x, bar, rotation, label, income in zip(theta, bars, rotations, labels, increasing['Median Income']):
lab = ax.text(x,bottom+bar.get_height() +100, label + " " + '$' +str(income), ha='left', va='center', rotation=rotation, rotation_mode="anchor",)
plt.title('Median Income per State', size = 20)
plt.show()
<Figure size 3600x1080 with 0 Axes>
Using the information found at the Pew Research Center, we went through all of the wealth distribution data found and made an excel sheet with the distribution percentages by state. We then put the excel sheet into a google drive and uploaded it to the colab sheet. Below is the table of the state distributions and also the national distribution. To show a better visuaization of data, we have created pie charts for a couple of states in order to show examples. This data will be used later to further analyze the contributing factors.
url1 ='https://drive.google.com/file/d/1SqqIEKXdUVKJsQmNxOn4K9agTvv1RHBP/view?usp=sharing'
url2='https://drive.google.com/uc?id=' + url1.split('/')[-2]
wealth_distributions = pd.read_excel(url2)
pie_chart = wealth_distributions.copy()
pie_chart = pie_chart.set_index('State')
ax = pie_chart.loc['District of Columbia'].plot.pie(title="District of Columbia Wealth Distribution",autopct='%1.1f%%',figsize=(5, 5))
ax.set_ylabel(' ')
plt.show()
ax = pie_chart.loc['Maryland'].plot.pie(title="Maryland Wealth Distribution",autopct='%1.1f%%',figsize=(5, 5))
ax.set_ylabel(' ')
plt.show()
ax = pie_chart.loc['Alabama'].plot.pie(title="Alabama Wealth Distribution",autopct='%1.1f%%',figsize=(5, 5))
ax.set_ylabel(' ')
plt.show()
ax = pie_chart.loc['Texas'].plot.pie(title="Texas Wealth Distribution",autopct='%1.1f%%',figsize=(5, 5))
ax.set_ylabel(' ')
plt.show()
Pie Charts Showing Wealth Distribution:
According to the “Median Income Per State” histogram, Maryland has the highest median income. So, it is not surprising that 52% of Marylanders are middle class. We hypothesize that Maryland will have low COVID-19 cases because of the high median income. The District of Columbia had 26% of its residents in the lower class while Maryland has 20.8%. We hypothesize that Maryland will have fewer COVID cases than DC because of the difference in median income.
density = pop_density.copy()
income_weath_distribution_popDensity = pd.concat([incomes.set_index('State'),wealth_distributions.set_index('State'),density.set_index('State')],axis=1)
income_weath_distribution_popDensity
| Median Income | Upper Class % | Middle Class % | Lower Class % | POPESTIMATE2019 | area (sq. mi) | Population Density | |
|---|---|---|---|---|---|---|---|
| Alabama | 56200 | 17.0 | 52.0 | 31.0 | 4903185 | 52423 | 93.5312 |
| Alaska | 78394 | 19.0 | 58.0 | 23.0 | 731545 | 656425 | 1.11444 |
| Arizona | 70674 | 17.0 | 54.0 | 30.0 | 7278717 | 114006 | 63.845 |
| Arkansas | 54539 | 15.0 | 52.0 | 32.0 | 3017804 | 53182 | 56.7448 |
| California | 78105 | 19.0 | 49.0 | 31.0 | 39512223 | 163707 | 241.359 |
| Colorado | 72499 | 22.0 | 55.0 | 23.0 | 5758736 | 104100 | 55.3193 |
| Connecticut | 87291 | 27.0 | 50.0 | 24.0 | 3565287 | 5544 | 643.089 |
| Delaware | 74194 | 19.0 | 56.0 | 25.0 | 973764 | 1954 | 498.344 |
| District of Columbia | 93111 | 36.0 | 38.0 | 26.0 | 705749 | 68 | 10378.7 |
| Florida | 58368 | 15.0 | 53.0 | 29.0 | 21477737 | 65758 | 326.618 |
| Georgia | 56628 | 20.0 | 51.0 | 29.0 | 10617423 | 59441 | 178.621 |
| Hawaii | 88006 | 18.0 | 58.0 | 24.0 | 1415872 | 10932 | 129.516 |
| Idaho | 65988 | 15.0 | 56.0 | 29.0 | 1787065 | 83574 | 21.383 |
| Illinois | 74399 | 22.0 | 52.0 | 26.0 | 12671821 | 57918 | 218.789 |
| Indiana | 66693 | 18.0 | 57.0 | 26.0 | 6732219 | 36420 | 184.85 |
| Iowa | 66054 | 19.0 | 58.0 | 22.0 | 3155070 | 56276 | 56.0642 |
| Kansas | 73151 | 20.0 | 55.0 | 25.0 | 2913314 | 82282 | 35.4065 |
| Kentucky | 55662 | 16.0 | 53.0 | 31.0 | 4467673 | 40411 | 110.556 |
| Louisiana | 51707 | 18.0 | 48.0 | 34.0 | 4648794 | 51843 | 89.6706 |
| Maine | 66546 | 15.0 | 56.0 | 29.0 | 1344212 | 35387 | 37.986 |
| Maryland | 95572 | 27.0 | 53.0 | 21.0 | 6045680 | 12407 | 487.28 |
| Massachusetts | 87707 | 26.0 | 51.0 | 23.0 | 6892503 | 10555 | 653.008 |
| Michigan | 64119 | 19.0 | 54.0 | 27.0 | 9986857 | 96810 | 103.159 |
| Minnesota | 81426 | 23.0 | 56.0 | 22.0 | 5639632 | 86943 | 64.8659 |
| Mississippi | 44787 | 14.0 | 51.0 | 36.0 | 2976149 | 48434 | 61.4475 |
| Missouri | 60597 | 19.0 | 55.0 | 26.0 | 6137428 | 69709 | 88.0436 |
| Montana | 60195 | 16.0 | 57.0 | 27.0 | 1068778 | 147046 | 7.26832 |
| Nebraska | 73071 | 19.0 | 58.0 | 24.0 | 1934408 | 77358 | 25.0059 |
| Nevada | 70906 | 17.0 | 56.0 | 28.0 | 3080156 | 110567 | 27.8578 |
| New Hampshire | 86900 | 23.0 | 57.0 | 21.0 | 1359711 | 9351 | 145.408 |
| New Jersey | 87726 | 24.0 | 51.0 | 24.0 | 8882190 | 8722 | 1018.37 |
| New Mexico | 53113 | 15.0 | 48.0 | 37.0 | 2096829 | 121593 | 17.2447 |
| New York | 71855 | 19.0 | 49.0 | 32.0 | 19453561 | 54475 | 357.11 |
| North Carolina | 61159 | 18.0 | 53.0 | 29.0 | 10488084 | 53821 | 194.87 |
| North Dakota | 70031 | 24.0 | 53.0 | 21.0 | 762062 | 70704 | 10.7782 |
| Ohio | 64663 | 20.0 | 54.0 | 25.0 | 11689100 | 44828 | 260.754 |
| Oklahoma | 59397 | 17.0 | 54.0 | 20.0 | 3956971 | 69903 | 56.6066 |
| Oregon | 74413 | 18.0 | 55.0 | 27.0 | 4217737 | 98386 | 42.8693 |
| Pennsylvania | 70582 | 19.0 | 54.0 | 27.0 | 12801989 | 46058 | 277.954 |
| Rhode Island | 70151 | 22.0 | 53.0 | 25.0 | 1059361 | 3515 | 301.383 |
| South Carolina | 62028 | 16.0 | 53.0 | 31.0 | 5148714 | 1545 | 3332.5 |
| South Dakota | 64255 | 17.0 | 57.0 | 26.0 | 884659 | 32007 | 27.6395 |
| Tennessee | 56627 | 17.0 | 54.0 | 30.0 | 6829174 | 77121 | 88.5514 |
| Texas | 67444 | 18.0 | 53.0 | 29.0 | 28995881 | 42146 | 687.987 |
| Utah | 84523 | 17.0 | 61.0 | 22.0 | 3205958 | 268601 | 11.9358 |
| Vermont | 74305 | 15.0 | 58.0 | 27.0 | 623989 | 84904 | 7.34935 |
| Virginia | 81313 | 25.0 | 51.0 | 24.0 | 8535519 | 9615 | 887.729 |
| Washington | 82454 | 22.0 | 54.0 | 24.0 | 7614893 | 42769 | 178.047 |
| West Virginia | 53706 | 14.0 | 52.0 | 34.0 | 1792147 | 71303 | 25.1342 |
| Wisconsin | 67355 | 20.0 | 58.0 | 23.0 | 5822434 | 24231 | 240.289 |
| Wyoming | 65134 | 19.0 | 57.0 | 24.0 | 578759 | 65503 | 8.83561 |
| Puerto Rico | NaN | NaN | NaN | NaN | 3193694 | 97818 | 32.6493 |
Here we are going through the table of states, counties, cases and deaths and are using the groupby and max functions to get the largest number of cases and deaths per county (the total cases and deaths). We then used the groupby and sum functions to add up all of the totals for all the counties for each state. Below is the table of totals for all states in America. This data
total_cases_and_deaths = all_counties.copy()
total_cases_and_deaths = total_cases_and_deaths.drop(columns=['date','fips'])
total_cases_and_deaths = total_cases_and_deaths.rename(columns={'state': 'State'})
total_cases_and_deaths = total_cases_and_deaths.set_index('State')
total_cases_and_deaths = total_cases_and_deaths.groupby(['State','county'])[['cases','deaths']].max()
total_cases_and_deaths = total_cases_and_deaths.groupby(['State']).sum()
total_cases_and_deaths = total_cases_and_deaths.drop(['Northern Mariana Islands','Virgin Islands','Guam'])
total_cases_and_deaths
| cases | deaths | |
|---|---|---|
| State | ||
| Alabama | 540097 | 11045.0 |
| Alaska | 69042 | 338.0 |
| Arizona | 873027 | 17479.0 |
| Arkansas | 339357 | 5820.0 |
| California | 3768863 | 62710.0 |
| Colorado | 534969 | 6601.0 |
| Connecticut | 344977 | 8184.0 |
| Delaware | 107279 | 1652.0 |
| District of Columbia | 48545 | 1120.0 |
| Florida | 2292440 | 36132.0 |
| Georgia | 1092260 | 19895.0 |
| Hawaii | 34200 | 491.0 |
| Idaho | 190455 | 2074.0 |
| Illinois | 1382159 | 24797.0 |
| Indiana | 738792 | 13481.0 |
| Iowa | 372225 | 6012.0 |
| Kansas | 313867 | 6203.0 |
| Kentucky | 455470 | 6840.0 |
| Louisiana | 471420 | 10713.0 |
| Maine | 65743 | 804.0 |
| Maryland | 456230 | 9011.0 |
| Massachusetts | 702039 | 17848.0 |
| Michigan | 984297 | 19812.0 |
| Minnesota | 594732 | 7390.0 |
| Mississippi | 315182 | 7260.0 |
| Missouri | 610665 | 9401.0 |
| Montana | 110788 | 1603.0 |
| Nebraska | 222496 | 2423.0 |
| Nevada | 320808 | 5543.0 |
| New Hampshire | 98244 | 1337.0 |
| New Jersey | 1016406 | 26054.0 |
| New Mexico | 200863 | 4130.0 |
| New York | 2083649 | 53588.0 |
| North Carolina | 997897 | 12891.0 |
| North Dakota | 111452 | 1548.0 |
| Ohio | 1092195 | 19719.0 |
| Oklahoma | 454183 | 6884.0 |
| Oregon | 195774 | 2604.0 |
| Pennsylvania | 1190595 | 26873.0 |
| Puerto Rico | 170271 | 2435.0 |
| Rhode Island | 154869 | 2895.0 |
| South Carolina | 588593 | 9661.0 |
| South Dakota | 123707 | 1999.0 |
| Tennessee | 847177 | 12295.0 |
| Texas | 2934054 | 52075.0 |
| Utah | 403798 | 2260.0 |
| Vermont | 23887 | 268.0 |
| Virginia | 670359 | 11145.0 |
| Washington | 425404 | 5685.0 |
| West Virginia | 158484 | 2858.0 |
| Wisconsin | 674105 | 7733.0 |
| Wyoming | 59110 | 712.0 |
Here is the combined table of all the information so far. We also included a normalized number of COVID cases and deaths per 1k people. We did this because we felt that larger populations would naturally have more cases and deaths.
income_weath_distribution_popDensity_covid = pd.concat([income_weath_distribution_popDensity,total_cases_and_deaths],axis=1)
wealth_data = income_weath_distribution_popDensity_covid.copy()
income_weath_distribution_popDensity_covid = income_weath_distribution_popDensity_covid.drop(['Puerto Rico'])
#swaping state names to abbreviations to make graphs cleaner
abbrev = pd.read_csv('https://raw.githubusercontent.com/jasonong/List-of-US-States/master/states.csv')
income_weath_distribution_popDensity_covid = pd.concat([income_weath_distribution_popDensity_covid,abbrev.set_index('State')],axis=1)
income_weath_distribution_popDensity_covid = income_weath_distribution_popDensity_covid.rename_axis('State')
income_weath_distribution_popDensity_covid['Cases per 1k Population'] = [None]*51
income_weath_distribution_popDensity_covid['Deaths per 1k Population'] = [None]*51
for index,row in income_weath_distribution_popDensity_covid.iterrows():
income_weath_distribution_popDensity_covid.at[index,'Cases per 1k Population'] =(row['cases']/row['POPESTIMATE2019']) * 1000
income_weath_distribution_popDensity_covid.at[index,'Deaths per 1k Population'] =(row['deaths']/row['POPESTIMATE2019']) * 1000
income_weath_distribution_popDensity_covid
| Median Income | Upper Class % | Middle Class % | Lower Class % | POPESTIMATE2019 | area (sq. mi) | Population Density | cases | deaths | Abbreviation | Cases per 1k Population | Deaths per 1k Population | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| State | ||||||||||||
| Alabama | 56200 | 17.0 | 52.0 | 31.0 | 4903185 | 52423 | 93.5312 | 540097 | 11045.0 | AL | 110.152 | 2.25262 |
| Alaska | 78394 | 19.0 | 58.0 | 23.0 | 731545 | 656425 | 1.11444 | 69042 | 338.0 | AK | 94.3783 | 0.462036 |
| Arizona | 70674 | 17.0 | 54.0 | 30.0 | 7278717 | 114006 | 63.845 | 873027 | 17479.0 | AZ | 119.942 | 2.40138 |
| Arkansas | 54539 | 15.0 | 52.0 | 32.0 | 3017804 | 53182 | 56.7448 | 339357 | 5820.0 | AR | 112.452 | 1.92855 |
| California | 78105 | 19.0 | 49.0 | 31.0 | 39512223 | 163707 | 241.359 | 3768863 | 62710.0 | CA | 95.3847 | 1.5871 |
| Colorado | 72499 | 22.0 | 55.0 | 23.0 | 5758736 | 104100 | 55.3193 | 534969 | 6601.0 | CO | 92.8969 | 1.14626 |
| Connecticut | 87291 | 27.0 | 50.0 | 24.0 | 3565287 | 5544 | 643.089 | 344977 | 8184.0 | CT | 96.76 | 2.29547 |
| Delaware | 74194 | 19.0 | 56.0 | 25.0 | 973764 | 1954 | 498.344 | 107279 | 1652.0 | DE | 110.169 | 1.69651 |
| District of Columbia | 93111 | 36.0 | 38.0 | 26.0 | 705749 | 68 | 10378.7 | 48545 | 1120.0 | DC | 68.7851 | 1.58697 |
| Florida | 58368 | 15.0 | 53.0 | 29.0 | 21477737 | 65758 | 326.618 | 2292440 | 36132.0 | FL | 106.736 | 1.6823 |
| Georgia | 56628 | 20.0 | 51.0 | 29.0 | 10617423 | 59441 | 178.621 | 1092260 | 19895.0 | GA | 102.874 | 1.87381 |
| Hawaii | 88006 | 18.0 | 58.0 | 24.0 | 1415872 | 10932 | 129.516 | 34200 | 491.0 | HI | 24.1547 | 0.346783 |
| Idaho | 65988 | 15.0 | 56.0 | 29.0 | 1787065 | 83574 | 21.383 | 190455 | 2074.0 | ID | 106.574 | 1.16056 |
| Illinois | 74399 | 22.0 | 52.0 | 26.0 | 12671821 | 57918 | 218.789 | 1382159 | 24797.0 | IL | 109.073 | 1.95686 |
| Indiana | 66693 | 18.0 | 57.0 | 26.0 | 6732219 | 36420 | 184.85 | 738792 | 13481.0 | IN | 109.74 | 2.00246 |
| Iowa | 66054 | 19.0 | 58.0 | 22.0 | 3155070 | 56276 | 56.0642 | 372225 | 6012.0 | IA | 117.977 | 1.9055 |
| Kansas | 73151 | 20.0 | 55.0 | 25.0 | 2913314 | 82282 | 35.4065 | 313867 | 6203.0 | KS | 107.735 | 2.12919 |
| Kentucky | 55662 | 16.0 | 53.0 | 31.0 | 4467673 | 40411 | 110.556 | 455470 | 6840.0 | KY | 101.948 | 1.531 |
| Louisiana | 51707 | 18.0 | 48.0 | 34.0 | 4648794 | 51843 | 89.6706 | 471420 | 10713.0 | LA | 101.407 | 2.30447 |
| Maine | 66546 | 15.0 | 56.0 | 29.0 | 1344212 | 35387 | 37.986 | 65743 | 804.0 | ME | 48.9082 | 0.59812 |
| Maryland | 95572 | 27.0 | 53.0 | 21.0 | 6045680 | 12407 | 487.28 | 456230 | 9011.0 | MD | 75.4638 | 1.49049 |
| Massachusetts | 87707 | 26.0 | 51.0 | 23.0 | 6892503 | 10555 | 653.008 | 702039 | 17848.0 | MA | 101.855 | 2.58948 |
| Michigan | 64119 | 19.0 | 54.0 | 27.0 | 9986857 | 96810 | 103.159 | 984297 | 19812.0 | MI | 98.5592 | 1.98381 |
| Minnesota | 81426 | 23.0 | 56.0 | 22.0 | 5639632 | 86943 | 64.8659 | 594732 | 7390.0 | MN | 105.456 | 1.31037 |
| Mississippi | 44787 | 14.0 | 51.0 | 36.0 | 2976149 | 48434 | 61.4475 | 315182 | 7260.0 | MS | 105.903 | 2.43939 |
| Missouri | 60597 | 19.0 | 55.0 | 26.0 | 6137428 | 69709 | 88.0436 | 610665 | 9401.0 | MO | 99.4985 | 1.53175 |
| Montana | 60195 | 16.0 | 57.0 | 27.0 | 1068778 | 147046 | 7.26832 | 110788 | 1603.0 | MT | 103.659 | 1.49984 |
| Nebraska | 73071 | 19.0 | 58.0 | 24.0 | 1934408 | 77358 | 25.0059 | 222496 | 2423.0 | NE | 115.02 | 1.25258 |
| Nevada | 70906 | 17.0 | 56.0 | 28.0 | 3080156 | 110567 | 27.8578 | 320808 | 5543.0 | NV | 104.153 | 1.79958 |
| New Hampshire | 86900 | 23.0 | 57.0 | 21.0 | 1359711 | 9351 | 145.408 | 98244 | 1337.0 | NH | 72.2536 | 0.983297 |
| New Jersey | 87726 | 24.0 | 51.0 | 24.0 | 8882190 | 8722 | 1018.37 | 1016406 | 26054.0 | NJ | 114.432 | 2.93329 |
| New Mexico | 53113 | 15.0 | 48.0 | 37.0 | 2096829 | 121593 | 17.2447 | 200863 | 4130.0 | NM | 95.7937 | 1.96964 |
| New York | 71855 | 19.0 | 49.0 | 32.0 | 19453561 | 54475 | 357.11 | 2083649 | 53588.0 | NY | 107.109 | 2.75466 |
| North Carolina | 61159 | 18.0 | 53.0 | 29.0 | 10488084 | 53821 | 194.87 | 997897 | 12891.0 | NC | 95.1458 | 1.22911 |
| North Dakota | 70031 | 24.0 | 53.0 | 21.0 | 762062 | 70704 | 10.7782 | 111452 | 1548.0 | ND | 146.251 | 2.03133 |
| Ohio | 64663 | 20.0 | 54.0 | 25.0 | 11689100 | 44828 | 260.754 | 1092195 | 19719.0 | OH | 93.437 | 1.68696 |
| Oklahoma | 59397 | 17.0 | 54.0 | 20.0 | 3956971 | 69903 | 56.6066 | 454183 | 6884.0 | OK | 114.78 | 1.73971 |
| Oregon | 74413 | 18.0 | 55.0 | 27.0 | 4217737 | 98386 | 42.8693 | 195774 | 2604.0 | OR | 46.4168 | 0.617393 |
| Pennsylvania | 70582 | 19.0 | 54.0 | 27.0 | 12801989 | 46058 | 277.954 | 1190595 | 26873.0 | PA | 93.0008 | 2.09913 |
| Rhode Island | 70151 | 22.0 | 53.0 | 25.0 | 1059361 | 3515 | 301.383 | 154869 | 2895.0 | RI | 146.191 | 2.73278 |
| South Carolina | 62028 | 16.0 | 53.0 | 31.0 | 5148714 | 1545 | 3332.5 | 588593 | 9661.0 | SC | 114.318 | 1.87639 |
| South Dakota | 64255 | 17.0 | 57.0 | 26.0 | 884659 | 32007 | 27.6395 | 123707 | 1999.0 | SD | 139.836 | 2.25963 |
| Tennessee | 56627 | 17.0 | 54.0 | 30.0 | 6829174 | 77121 | 88.5514 | 847177 | 12295.0 | TN | 124.053 | 1.80036 |
| Texas | 67444 | 18.0 | 53.0 | 29.0 | 28995881 | 42146 | 687.987 | 2934054 | 52075.0 | TX | 101.189 | 1.79594 |
| Utah | 84523 | 17.0 | 61.0 | 22.0 | 3205958 | 268601 | 11.9358 | 403798 | 2260.0 | UT | 125.952 | 0.704937 |
| Vermont | 74305 | 15.0 | 58.0 | 27.0 | 623989 | 84904 | 7.34935 | 23887 | 268.0 | VT | 38.2811 | 0.429495 |
| Virginia | 81313 | 25.0 | 51.0 | 24.0 | 8535519 | 9615 | 887.729 | 670359 | 11145.0 | VA | 78.5376 | 1.30572 |
| Washington | 82454 | 22.0 | 54.0 | 24.0 | 7614893 | 42769 | 178.047 | 425404 | 5685.0 | WA | 55.8647 | 0.746563 |
| West Virginia | 53706 | 14.0 | 52.0 | 34.0 | 1792147 | 71303 | 25.1342 | 158484 | 2858.0 | WV | 88.4325 | 1.59474 |
| Wisconsin | 67355 | 20.0 | 58.0 | 23.0 | 5822434 | 24231 | 240.289 | 674105 | 7733.0 | WI | 115.777 | 1.32814 |
| Wyoming | 65134 | 19.0 | 57.0 | 24.0 | 578759 | 65503 | 8.83561 | 59110 | 712.0 | WY | 102.132 | 1.23022 |
Below we went through each state and found the average daily increase of COVID cases and deaths per 1k people for each state
states = []
for df in df_list:
states.append(df['state'].tolist())
all = []
for state in states:
for s in state:
if s not in all:
all.append(s)
i=0
cases_daily_increase = {}
death_daily_increase = {}
for df in df_list:
x = df.groupby('state')
for y in range(len(x)):
state = all[i]
if state !='Guam' and state !='Northern Mariana Islands' and state !='Puerto Rico' and state !='Virgin Islands':
data = x.get_group(state).reset_index(drop=True)
cases_reg = LinearRegression().fit(np.array(data.index).reshape(-1, 1),((data['cases'].values/income_weath_distribution_popDensity_covid['POPESTIMATE2019'][state])*1000).reshape(-1, 1))
death_reg = LinearRegression().fit(np.array(data.index).reshape(-1, 1),((data['deaths'].values/income_weath_distribution_popDensity_covid['POPESTIMATE2019'][state])*1000).reshape(-1, 1))
case_m = cases_reg.coef_[0]
death_m = death_reg.coef_[0]
cases_daily_increase[state] = case_m.item(0)
death_daily_increase[state] = death_m.item(0)
i+=1
cases_daily_increase = dict( sorted(cases_daily_increase.items(), key=lambda x: x[0].lower()) )
death_daily_increase = dict( sorted(death_daily_increase.items(), key=lambda x: x[0].lower()) )
cases_daily_increase
{'Alabama': 0.3124920454702609,
'Alaska': 0.2676809301612137,
'Arizona': 0.30314023869979567,
'Arkansas': 0.3281796645648714,
'California': 0.2425635286981636,
'Colorado': 0.24091635865109248,
'Connecticut': 0.23730953354809964,
'Delaware': 0.27619750857482667,
'District of Columbia': 0.1649844529030956,
'Florida': 0.2741698264928856,
'Georgia': 0.2767007827423677,
'Hawaii': 0.06419272981197092,
'Idaho': 0.3133624045671537,
'Illinois': 0.26630740385996765,
'Indiana': 0.30801127780959037,
'Iowa': 0.3389548821337311,
'Kansas': 0.3167168756915987,
'Kentucky': 0.2845283191520578,
'Louisiana': 0.26927497513852033,
'Maine': 0.11127900734762568,
'Maryland': 0.19007771748279303,
'Massachusetts': 0.22996238387426013,
'Michigan': 0.22772589825497425,
'Minnesota': 0.28467039546800676,
'Mississippi': 0.30098205388158283,
'Missouri': 0.28460360747055297,
'Montana': 0.3121358578331966,
'Nebraska': 0.3148330069432048,
'Nevada': 0.29426932796621125,
'New Hampshire': 0.17364694858260607,
'New Jersey': 0.2607499034023974,
'New Mexico': 0.2776795305359753,
'New York': 0.23528793515953594,
'North Carolina': 0.25106517447365545,
'North Dakota': 0.4388341970099949,
'Ohio': 0.26085805377878696,
'Oklahoma': 0.33100223140824564,
'Oregon': 0.11637067523983813,
'Pennsylvania': 0.23113871192373245,
'Rhode Island': 0.3602013102851736,
'South Carolina': 0.30416314692303636,
'South Dakota': 0.41764255143513995,
'Tennessee': 0.3468167199370685,
'Texas': 0.26495267845578274,
'Utah': 0.35216485978661743,
'Vermont': 0.0833138676735359,
'Virginia': 0.2041156314938369,
'Washington': 0.12675483720209132,
'West Virginia': 0.24344779070995007,
'Wisconsin': 0.31571831708730796,
'Wyoming': 0.3067765599028716}
death_daily_increase
{'Alabama': 0.005912638942792392,
'Alaska': 0.0012451299656948106,
'Arizona': 0.005820531980316021,
'Arkansas': 0.005554833323672776,
'California': 0.0035722192215602096,
'Colorado': 0.002856624972635096,
'Connecticut': 0.0047162194341092084,
'Delaware': 0.0038853264195757025,
'District of Columbia': 0.0033546404899030048,
'Florida': 0.004331791547686282,
'Georgia': 0.004635844629553969,
'Hawaii': 0.0009596399664922706,
'Idaho': 0.0033222452989178936,
'Illinois': 0.004670989933651811,
'Indiana': 0.0051268221132129465,
'Iowa': 0.00518425324378343,
'Kansas': 0.004793461068119852,
'Kentucky': 0.0035756469806986248,
'Louisiana': 0.005358304842061762,
'Maine': 0.0014790624531876742,
'Maryland': 0.003368846944317173,
'Massachusetts': 0.005743326086691088,
'Michigan': 0.004290464561932055,
'Minnesota': 0.003443443504187841,
'Mississippi': 0.006625002631840353,
'Missouri': 0.00411554662565436,
'Montana': 0.004223359963631493,
'Nebraska': 0.003298638090882738,
'Nevada': 0.00472570036794934,
'New Hampshire': 0.0023291337185158997,
'New Jersey': 0.005815719890755664,
'New Mexico': 0.0052911739021684265,
'New York': 0.0047574721580621,
'North Carolina': 0.003171914091938798,
'North Dakota': 0.006229021932519293,
'Ohio': 0.004091240325191668,
'Oklahoma': 0.0038390376800014857,
'Oregon': 0.0015799294603967594,
'Pennsylvania': 0.00506642190150336,
'Rhode Island': 0.006089278354196431,
'South Carolina': 0.004953698813873409,
'South Dakota': 0.006711697540586369,
'Tennessee': 0.004947212066061075,
'Texas': 0.004470115422710877,
'Utah': 0.0018066112238541534,
'Vermont': 0.0008729723612167171,
'Virginia': 0.003026016705703502,
'Washington': 0.0016805532088196389,
'West Virginia': 0.004268481517941363,
'Wisconsin': 0.003387759726774211,
'Wyoming': 0.003597505133695907}
plt.figure(figsize=(25,15))
plt.bar(cases_daily_increase.keys(),cases_daily_increase.values())
plt.xticks(rotation='vertical')
plt.xticks(size = 20)
plt.yticks(size = 20)
plt.xlabel('State', size = 20)
plt.ylabel('Daily Increase of COVID Cases Per 1K people', size = 20)
plt.title('Daily Increase of COVID Cases Per 1K people per State', size = 30)
plt.show()
plt.figure(figsize=(25,15))
plt.bar(death_daily_increase.keys(),death_daily_increase.values())
plt.xticks(rotation='vertical')
plt.xticks(size = 20)
plt.yticks(size = 20)
plt.xlabel('State', size = 20)
plt.ylabel('Daily Increase of COVID Deaths Per 1K people', size = 20)
plt.title('Daily Increase of COVID Deaths Per 1K people per State', size = 30)
plt.show()
PREDICTIONS POPULATION VS TOTAL CASES:
In this scatter plot we take the total COVID-19 cases for each state and compare it to the population of that state. We predict that there is a direct correlation between total COVID-19 cases and the state’s population. The states with the largest populations have the most people to get COVID-19. We expect there to be some outliers that are far off from the regression line, but most values should be close to it and have total cases increase as the population increases.
plt.figure(figsize=(35,15))
sns.scatterplot(data=income_weath_distribution_popDensity_covid, x="POPESTIMATE2019", y="cases", hue="State",s=200)
plt.legend(bbox_to_anchor=(1.01, 1),borderaxespad=0)
plt.xlabel('Population', size = 20)
plt.ylabel('Total Cases x 1,000,000', size = 20)
plt.title('Population Vs Total Cases', size = 30)
for x, y, State in zip(income_weath_distribution_popDensity_covid['POPESTIMATE2019'], income_weath_distribution_popDensity_covid['cases'],income_weath_distribution_popDensity_covid['Abbreviation'] ):
plt.text(x = x, y = y-150, s = State,color = 'black',fontsize=12)
reg = LinearRegression().fit((income_weath_distribution_popDensity_covid['POPESTIMATE2019'].values).reshape(-1, 1), (income_weath_distribution_popDensity_covid['cases'].values).reshape(-1, 1))
x_values = income_weath_distribution_popDensity_covid['POPESTIMATE2019'].values
plt.plot(x_values,x_values*(reg.coef_[0].item())+reg.intercept_)
plt.show()
RESULTS OF POPULATION VS TOTAL CASES:
Our predictions were correct. There is a direct correlation between total cases and state population. California had 3.5 million covid cases and a population of 40 million, California has the largest population and the largest number of COVID-19 cases. Even then California is under the regression line. The expected number of COVID-19 cases for a state with a population of 40 million is about 3.7 million. Something surprising about this graph is that there are no drastic outliers. All of the plots are close to the regression line.
PREDICTION OF TOTAL POPULATION VS TOTAL DEATHS:
We predict that the plots will follow the Total population vs total cases graphs. Earlier in the project, we showed the similarities between the cases over time graph and the deaths over time graph, so we predict that the population vs total deaths graph will also follow the total cases graph. Since the Total Population VS Total Cases graph did not have outliers and the cases and deaths are so closely related we predict that this graph will have no outliers or very few outliers as well, and most plots will follow the regression line. We predict that there is a direct correlation between the population of a state and the total COVID deaths in a state.
plt.figure(figsize=(35,15))
sns.scatterplot(data=income_weath_distribution_popDensity_covid, x="POPESTIMATE2019", y="deaths", hue="State",s=200)
plt.legend(bbox_to_anchor=(1.01, 1),borderaxespad=0)
plt.xlabel('Population', size = 20)
plt.ylabel('Total Deaths', size = 20)
plt.title('Population Vs Total Deaths', size = 30)
for x, y, State in zip(income_weath_distribution_popDensity_covid['POPESTIMATE2019'], income_weath_distribution_popDensity_covid['deaths'],income_weath_distribution_popDensity_covid['Abbreviation'] ):
plt.text(x = x, y = y-150, s = State,color = 'black',fontsize=12)
reg = LinearRegression().fit((income_weath_distribution_popDensity_covid['POPESTIMATE2019'].values).reshape(-1, 1), (income_weath_distribution_popDensity_covid['deaths'].values).reshape(-1, 1))
x_values = income_weath_distribution_popDensity_covid['POPESTIMATE2019'].values
plt.plot(x_values,x_values*(reg.coef_[0].item())+reg.intercept_)
plt.show()
plt.figure(figsize=(35,15))
x_val = income_weath_distribution_popDensity_covid['Median Income'].values
y_val = death_daily_increase.values()
plt.scatter(x=x_val, y=y_val,s=200)
plt.xlabel('Median Income in Dollars', size = 20)
plt.ylabel('Daily Increase in Deaths', size = 20)
plt.title('Daily Increase in Deaths Vs Median Income', size = 30)
for x, y, State in zip(x_val, y_val,income_weath_distribution_popDensity_covid['Abbreviation']):
plt.text(x = x, y = y, s = State,color = 'black',fontsize=12)
reg = LinearRegression().fit((x_val).reshape(-1, 1), (np.array(list(y_val))).reshape(-1, 1))
x_values = x_val
plt.plot(x_values,x_values*(reg.coef_[0].item())+reg.intercept_)
plt.show()
RESULTS OF TOTAL POPULATION VS TOTAL DEATHS:
Our general prediction that there would be a correlation between the two was correct. The plots do not follow along the regression line as closely as the plots do in the population vs cases graph. New York is the biggest outlier. New York has a population of about 20 million people and there were about 50,000 COVID-19 related deaths. According to the regression line, the expected number of deaths is about 35,000. Most plots followed the regression line, and there is a correlation between the total population of a state and the total COVID-19 deaths in the state.
PREDICTION POPULATION DENSITY VS CASES & DEATHS PER 1k POPULATION
We predict that the cases and deaths by population density graphs will be very similar. They will be similar because throughout the project trends with cases and deaths have been similar. We predict that there will be more cases and deaths in the states with the greatest population density. Logically this would make sense because in metropolitan areas there is more person-to-person interaction especially since quarantine restrictions have been lifted, and the public is opening up again. So we think there will be a correlation between the population density and the amount of COVID cases and deaths.
plt.figure(figsize=(35,15))
sns.scatterplot(data=income_weath_distribution_popDensity_covid, x="Population Density", y="Cases per 1k Population", hue="State",s=200)
plt.legend(bbox_to_anchor=(1.01, 1),borderaxespad=0)
plt.xlabel('Population Density (Total Population/Area)', size = 20)
plt.ylabel('Cases per 1k Population', size = 20)
plt.title('Population Density Vs Cases per 1k Population', size = 30)
for x, y, State in zip(income_weath_distribution_popDensity_covid['Population Density'], income_weath_distribution_popDensity_covid['Cases per 1k Population'],income_weath_distribution_popDensity_covid['Abbreviation'] ):
plt.text(x = x, y = y-1, s = State,color = 'black',fontsize=12)
reg = LinearRegression().fit((income_weath_distribution_popDensity_covid['Population Density'].values).reshape(-1, 1), (income_weath_distribution_popDensity_covid['Cases per 1k Population'].values).reshape(-1, 1))
x_values = income_weath_distribution_popDensity_covid['Population Density'].values
plt.plot(x_values,x_values*(reg.coef_[0].item())+reg.intercept_)
plt.show()
plt.figure(figsize=(35,15))
sns.scatterplot(data=income_weath_distribution_popDensity_covid, x="Population Density", y="Deaths per 1k Population",hue="State",s=200)
plt.legend(bbox_to_anchor=(1.01, 1),borderaxespad=0)
plt.xlabel('Population Density (Total Population/Area)', size = 20)
plt.ylabel('Deaths per 1k Population', size = 20)
plt.title('Population Density Vs Deaths per 1k Population', size = 30)
for x, y, State in zip(income_weath_distribution_popDensity_covid['Population Density'], income_weath_distribution_popDensity_covid['Deaths per 1k Population'],income_weath_distribution_popDensity_covid['Abbreviation'] ):
plt.text(x = x, y = y, s = State,color = 'black',fontsize=12)
reg = LinearRegression().fit((income_weath_distribution_popDensity_covid['Population Density'].values).reshape(-1, 1), (income_weath_distribution_popDensity_covid['Deaths per 1k Population'].values).reshape(-1, 1))
x_values = income_weath_distribution_popDensity_covid['Population Density'].values
plt.plot(x_values,x_values*(reg.coef_[0].item())+reg.intercept_)
plt.show()
RESULTS POPULATION DENSITY VS CASES & DEATHS PER 1k POPULATION
The graphs do not show a correlation between population density and covid cases and deaths. The densest populated area DC, around the median amount of deaths and cases, compared to the other areas. This would have worked better comparing US cities because Texas has densely populated areas, but there is also a lot of open land in Texas, and the graph doesn’t show that.
PREDICTION MEDIAN INCOME VS CASES PER 1K POPULATION
We expect there to be a correlation between a state’s median income and the total COVID-19 cases in that state. We determined that population is a major contributing factor to the number of COVID-19 cases, So I would expect that states with a larger population of lower-class people have more COVID-19 cases. These people are more likely to be the essential workers that did not have the privilege to work from home, and they would be exposed to the virus more. States with a larger median income would have less of their population working in the essential roles, so there would be fewer cases.
plt.figure(figsize=(35,15))
sns.scatterplot(data=income_weath_distribution_popDensity_covid, x="Median Income", y="Cases per 1k Population", hue="State",s=200)
plt.legend(bbox_to_anchor=(1.01, 1),borderaxespad=0)
plt.xlabel('Median Income in Dollars', size = 20)
plt.ylabel('Cases per 1k Population', size = 20)
plt.title('Median Income Vs Cases per 1k Population', size = 30)
for x, y, State in zip(income_weath_distribution_popDensity_covid['Median Income'], income_weath_distribution_popDensity_covid['Cases per 1k Population'],income_weath_distribution_popDensity_covid['Abbreviation']):
plt.text(x = x+1, y = y-1, s = State,color = 'black',fontsize=12)
reg = LinearRegression().fit((income_weath_distribution_popDensity_covid['Median Income'].values).reshape(-1, 1), (income_weath_distribution_popDensity_covid['Cases per 1k Population'].values).reshape(-1, 1))
x_values = income_weath_distribution_popDensity_covid['Median Income'].values
plt.plot(x_values,x_values*(reg.coef_[0].item())+reg.intercept_)
plt.show()
RESULTS MEDIAN INCOME VS CASES PER 1K POPULATION
For this graph, we originally looked at the Median Income vs Total Cases, but we realized that the graph was giving meaningless information because the states with the largest population were going to have the largest cases regardless of the median income. To fix this we changed total cases to cases per 1,000 population to normalize the data.
Our predictions were correct, and there seems to be a correlation between median income and cases per 1000. There are several outliers, but the regressing line has a negative slope which means that as states’ median income increases the number of reported cases decreases. A notable outlier is HI, there are 20 cases per 1,000 people and it has one of the largest median incomes. Hawaii is isolated from the mainland United States, and there are strict regulations for tourists to travel there because of COVID, so it’s no wonder that the cases are low.
PREDICTION MEDIAN INCOME VS DEATHS PER 1k POPULATION
We predict that the plots will line up almost the same as in median income vs cases per 1,000 population. There might be even more of a correlation between income and deaths because income can directly affect the quality of treatment individuals suffering from COVID-19, with people from high median income areas receiving higher-quality treatment, so there would be fewer deaths in those states. So, there should be a correlation between Median Income vs Deaths Per 1k Population.
plt.figure(figsize=(35,15))
sns.scatterplot(data=income_weath_distribution_popDensity_covid, x="Median Income", y="Deaths per 1k Population", hue="State",s=200)
plt.legend(bbox_to_anchor=(1.01, 1),borderaxespad=0)
plt.xlabel('Median Income in Dollars', size = 20)
plt.ylabel('Deaths per 1k Population', size = 20)
plt.title('Median Income Vs Deaths per 1k Population', size = 30)
for x, y, State in zip(income_weath_distribution_popDensity_covid['Median Income'], income_weath_distribution_popDensity_covid['Deaths per 1k Population'],income_weath_distribution_popDensity_covid['Abbreviation'] ):
plt.text(x = x+.01, y = y-.01, s = State,color = 'black',fontsize=12)
reg = LinearRegression().fit((income_weath_distribution_popDensity_covid['Median Income'].values).reshape(-1, 1), (income_weath_distribution_popDensity_covid['Deaths per 1k Population'].values).reshape(-1, 1))
x_values = income_weath_distribution_popDensity_covid['Median Income'].values
plt.plot(x_values,x_values*(reg.coef_[0].item())+reg.intercept_)
plt.show()
plt.figure(figsize=(35,15))
x_val = income_weath_distribution_popDensity_covid['Median Income'].values
y_val = cases_daily_increase.values()
plt.scatter(x=x_val, y=y_val,s=200)
plt.xlabel('Median Income in Dollars', size = 20)
plt.ylabel('Daily Increase in Cases', size = 20)
plt.title('Daily Increase in Cases Vs Median Income', size = 30)
for x, y, State in zip(x_val, y_val,income_weath_distribution_popDensity_covid['Abbreviation']):
plt.text(x = x, y = y, s = State,color = 'black',fontsize=12)
reg = LinearRegression().fit((x_val).reshape(-1, 1), (np.array(list(y_val))).reshape(-1, 1))
x_values = x_val
plt.plot(x_values,x_values*(reg.coef_[0].item())+reg.intercept_)
plt.show()
MEDIAN INCOME VS DEATH PER 1K POPULATION
The graph followed along with the median income vs cases per 1,000 population. The median income vs deaths per 1,000 population do correlate and the regression line has a negative slope again. The states' deaths are far off from the regression line, so a lot of states' deaths are not what is expected based on the regression line.
PREDICTION Weath Distribution VS Cases PER 1k POPULATION
Here were are analyzing how the distribution of wealth in each state affects COVID cases and deaths. We are doing this because the cost of living throughout the United States varies by state to state so median income might not be conclusive. Our prediction is that the states with more upper and middle class people will have less covid cases and deaths since they would be able to afford better treatment and have access to more resources and the states with a higher lower class percentage will have more cases and deaths.
plt.figure(figsize=(35,15))
sns.scatterplot(data=income_weath_distribution_popDensity_covid, x="Upper Class %", y="Cases per 1k Population", hue="State",s=200)
plt.legend(bbox_to_anchor=(1.01, 1),borderaxespad=0)
plt.xlabel('Upper Class %', size = 20)
plt.ylabel('Cases per 1k Population', size = 20)
plt.title('Upper Class % Vs Cases per 1k Population', size = 30)
for x, y, State in zip(income_weath_distribution_popDensity_covid['Upper Class %'], income_weath_distribution_popDensity_covid['Cases per 1k Population'],income_weath_distribution_popDensity_covid['Abbreviation']):
plt.text(x = x, y = y-1, s = State,color = 'black',fontsize=12)
reg = LinearRegression().fit((income_weath_distribution_popDensity_covid['Upper Class %'].values).reshape(-1, 1), (income_weath_distribution_popDensity_covid['Cases per 1k Population'].values).reshape(-1, 1))
x_values = income_weath_distribution_popDensity_covid['Upper Class %'].values
plt.plot(x_values,x_values*(reg.coef_[0].item())+reg.intercept_)
plt.show()
plt.figure(figsize=(35,15))
sns.scatterplot(data=income_weath_distribution_popDensity_covid, x="Upper Class %", y="Deaths per 1k Population", hue="State",s=200)
plt.legend(bbox_to_anchor=(1.01, 1),borderaxespad=0)
plt.xlabel('Upper Class %', size = 20)
plt.ylabel('Deaths per 1k Population', size = 20)
plt.title('Upper Class % Vs Deaths per 1k Population', size = 30)
for x, y, State in zip(income_weath_distribution_popDensity_covid['Upper Class %'], income_weath_distribution_popDensity_covid['Deaths per 1k Population'],income_weath_distribution_popDensity_covid['Abbreviation']):
plt.text(x = x, y = y, s = State,color = 'black',fontsize=12)
reg = LinearRegression().fit((income_weath_distribution_popDensity_covid['Upper Class %'].values).reshape(-1, 1), (income_weath_distribution_popDensity_covid['Deaths per 1k Population'].values).reshape(-1, 1))
x_values = income_weath_distribution_popDensity_covid['Upper Class %'].values
plt.plot(x_values,x_values*(reg.coef_[0].item())+reg.intercept_)
plt.show()
plt.figure(figsize=(35,15))
sns.scatterplot(data=income_weath_distribution_popDensity_covid, x="Middle Class %", y="Cases per 1k Population", hue="State",s=200)
plt.legend(bbox_to_anchor=(1.01, 1),borderaxespad=0)
plt.xlabel('Middle Class %', size = 20)
plt.ylabel('Cases per 1k Population', size = 20)
plt.title('Middle Class % Vs Cases per 1k Population', size = 30)
for x, y, State in zip(income_weath_distribution_popDensity_covid['Middle Class %'], income_weath_distribution_popDensity_covid['Cases per 1k Population'],income_weath_distribution_popDensity_covid['Abbreviation']):
plt.text(x = x, y = y, s = State,color = 'black',fontsize=12)
reg = LinearRegression().fit((income_weath_distribution_popDensity_covid['Middle Class %'].values).reshape(-1, 1), (income_weath_distribution_popDensity_covid['Cases per 1k Population'].values).reshape(-1, 1))
x_values = income_weath_distribution_popDensity_covid['Middle Class %'].values
plt.plot(x_values,x_values*(reg.coef_[0].item())+reg.intercept_)
plt.show()
plt.figure(figsize=(35,15))
sns.scatterplot(data=income_weath_distribution_popDensity_covid, x="Middle Class %", y="Deaths per 1k Population", hue="State",s=200)
plt.legend(bbox_to_anchor=(1.01, 1),borderaxespad=0)
plt.xlabel('Middle Class %', size = 20)
plt.ylabel('Cases per 1k Population', size = 20)
plt.title('Middle Class % Vs Cases per 1k Population', size = 30)
for x, y, State in zip(income_weath_distribution_popDensity_covid['Middle Class %'], income_weath_distribution_popDensity_covid['Deaths per 1k Population'],income_weath_distribution_popDensity_covid['Abbreviation']):
plt.text(x = x, y = y, s = State,color = 'black',fontsize=12)
reg = LinearRegression().fit((income_weath_distribution_popDensity_covid['Middle Class %'].values).reshape(-1, 1), (income_weath_distribution_popDensity_covid['Deaths per 1k Population'].values).reshape(-1, 1))
x_values = income_weath_distribution_popDensity_covid['Middle Class %'].values
plt.plot(x_values,x_values*(reg.coef_[0].item())+reg.intercept_)
plt.show()
plt.figure(figsize=(35,15))
sns.scatterplot(data=income_weath_distribution_popDensity_covid, x="Lower Class %", y="Cases per 1k Population", hue="State",s=200)
plt.legend(bbox_to_anchor=(1.01, 1),borderaxespad=0)
plt.xlabel('Lower Class %', size = 20)
plt.ylabel('Cases per 1k Population', size = 20)
plt.title('Lower Class % Vs Cases per 1k Population', size = 30)
for x, y, State in zip(income_weath_distribution_popDensity_covid['Lower Class %'], income_weath_distribution_popDensity_covid['Cases per 1k Population'],income_weath_distribution_popDensity_covid['Abbreviation']):
plt.text(x = x, y = y, s = State,color = 'black',fontsize=12)
reg = LinearRegression().fit((income_weath_distribution_popDensity_covid['Lower Class %'].values).reshape(-1, 1), (income_weath_distribution_popDensity_covid['Cases per 1k Population'].values).reshape(-1, 1))
x_values = income_weath_distribution_popDensity_covid['Lower Class %'].values
plt.plot(x_values,x_values*(reg.coef_[0].item())+reg.intercept_)
plt.show()
plt.figure(figsize=(35,15))
sns.scatterplot(data=income_weath_distribution_popDensity_covid, x="Lower Class %", y="Deaths per 1k Population", hue="State",s=200)
plt.legend(bbox_to_anchor=(1.01, 1),borderaxespad=0)
plt.xlabel('Lower Class %', size = 20)
plt.ylabel('Cases per 1k Population', size = 20)
plt.title('Lower Class % Vs Cases per 1k Population', size = 30)
for x, y, State in zip(income_weath_distribution_popDensity_covid['Lower Class %'], income_weath_distribution_popDensity_covid['Deaths per 1k Population'],income_weath_distribution_popDensity_covid['Abbreviation']):
plt.text(x = x, y = y, s = State,color = 'black',fontsize=12)
reg = LinearRegression().fit((income_weath_distribution_popDensity_covid['Lower Class %'].values).reshape(-1, 1), (income_weath_distribution_popDensity_covid['Deaths per 1k Population'].values).reshape(-1, 1))
x_values = income_weath_distribution_popDensity_covid['Lower Class %'].values
plt.plot(x_values,x_values*(reg.coef_[0].item())+reg.intercept_)
plt.show()
RESULTS Weath Distribution VS Cases PER 1k POPULATION
After analyzing the graphs and adding regression lines to see trends, our prediction wasn't all the way right. While states with a higher upper class percentage did have less covid cases, the actually ended up having more covid deaths. This may be because wealthier people tend to be in the later stages of life, and COVID is much harder to handle the older you get.
In conclusion, we found that some of the factors we tested are directly correlated to COVID-19 contraction and death with one of the most apparent being population size leads to more covid cases and deaths. The bigger the state’s population the more COVID-19 cases and deaths it has had. In the cases and deaths vs time graphs, we determined that June 2020 and the 2020 holiday season had the most COVID-19 cases and deaths. We did not find a correlation between the number of daily cases and deaths in each state compared to the state’s median income. Similarly, we did not find a correlation between the percentage of each class compared to cases and deaths per 1,000 people. Based on the information our graphs showed we also could not find a correlation between population density and cases and deaths per 1,000 people. Population density would give more meaningful data if we looked at the individual US cities because many of the densely populated cities are in big states with lots of open lands. The population is the greatest contributor to the number of COVID-19 cases and deaths a state has. It is such an overwhelming contributor that we had to normalize our data and include cases per 1,000 people so that we could evenly compare the states.